{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "86221567",
   "metadata": {},
   "outputs": [],
   "source": [
    "%run -i \"../util/file_utils.ipynb\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "89d59aa6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "To Sherlock Holmes she is always _the_ woman. I have seldom heard him\n",
      "mention her under any other name. In his eyes she eclipses and\n",
      "predominates the whole of her sex. It was not that he felt any emotion\n",
      "akin to love for Irene Adler. All emotions, and that one particularly,\n",
      "were abhorrent to his cold, precise but admirably balanced mind. He\n",
      "was, I take it, the most perfect reasoning and observing machine that\n",
      "the world has seen, but as a lover he would have placed himself in a\n",
      "false position. He never spoke of the softer passions, save with a gibe\n",
      "and a sneer. They were admirable things for the observer—excellent for\n",
      "drawing the veil from men’s motives and actions. But for the trained\n",
      "reasoner to admit such intrusions into his own delicate and finely\n",
      "adjusted temperament was to introduce a distracting factor which might\n",
      "throw a doubt upon all his mental results. Grit in a sensitive\n",
      "instrument, or a crack in one of his own high-power lenses, would not\n",
      "be more disturbing than a strong emotion in a nature such as his. And\n",
      "yet there was but one woman to him, and that woman was the late Irene\n",
      "Adler, of dubious and questionable memory.\n"
     ]
    }
   ],
   "source": [
    "sherlock_holmes_part_of_text = read_text_file(\"../data/sherlock_holmes_1.txt\")\n",
    "print(sherlock_holmes_part_of_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6997b96",
   "metadata": {},
   "source": [
    "# Divide text into words using NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "90a272e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', '.', 'All', 'emotions', ',', 'and', 'that', 'one', 'particularly', ',', 'were', 'abhorrent', 'to', 'his', 'cold', ',', 'precise', 'but', 'admirably', 'balanced', 'mind', '.', 'He', 'was', ',', 'I', 'take', 'it', ',', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', ',', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', '.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', ',', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', '.', 'They', 'were', 'admirable', 'things', 'for', 'the', 'observer—excellent', 'for', 'drawing', 'the', 'veil', 'from', 'men', '’', 's', 'motives', 'and', 'actions', '.', 'But', 'for', 'the', 'trained', 'reasoner', 'to', 'admit', 'such', 'intrusions', 'into', 'his', 'own', 'delicate', 'and', 'finely', 'adjusted', 'temperament', 'was', 'to', 'introduce', 'a', 'distracting', 'factor', 'which', 'might', 'throw', 'a', 'doubt', 'upon', 'all', 'his', 'mental', 'results', '.', 'Grit', 'in', 'a', 'sensitive', 'instrument', ',', 'or', 'a', 'crack', 'in', 'one', 'of', 'his', 'own', 'high-power', 'lenses', ',', 'would', 'not', 'be', 'more', 'disturbing', 'than', 'a', 'strong', 'emotion', 'in', 'a', 'nature', 'such', 'as', 'his', '.', 'And', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', ',', 'and', 'that', 'woman', 'was', 'the', 'late', 'Irene', 'Adler', ',', 'of', 'dubious', 'and', 'questionable', 'memory', '.']\n",
      "230\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "words_nltk = nltk.tokenize.word_tokenize(sherlock_holmes_part_of_text)\n",
    "print(words_nltk)\n",
    "print(len(words_nltk))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "435633c0",
   "metadata": {},
   "source": [
    "# Custom tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5cbfbee0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Last', 'night', 'I', 'went', 'for', 'dinner', 'in', 'an', 'Italian', 'restaurant.', 'The', 'pasta', 'was', 'delicious.']\n",
      "['I', 'went', 'out', 'to', 'a', 'dim_sum_dinner', 'last', 'night.', 'This', 'restaurant', 'has', 'the_best_dim_sum', 'in', 'town.']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import MWETokenizer\n",
    "tokenizer = MWETokenizer([('dim', 'sum', 'dinner')])\n",
    "tokenizer.add_mwe(('the', 'best', 'dim', 'sum'))\n",
    "tokens = tokenizer.tokenize('Last night I went for dinner in an Italian restaurant. The pasta was delicious.'.split())\n",
    "print(tokens)\n",
    "tokens = tokenizer.tokenize('I went out to a dim sum dinner last night. This restaurant has the best dim sum in town.'.split())\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a5a39f",
   "metadata": {},
   "source": [
    "# Divide text into words using spaCy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "053ae32f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_', 'the', '_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', '\\n', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', '\\n', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', '\\n', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', '.', 'All', 'emotions', ',', 'and', 'that', 'one', 'particularly', ',', '\\n', 'were', 'abhorrent', 'to', 'his', 'cold', ',', 'precise', 'but', 'admirably', 'balanced', 'mind', '.', 'He', '\\n', 'was', ',', 'I', 'take', 'it', ',', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', '\\n', 'the', 'world', 'has', 'seen', ',', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', '\\n', 'false', 'position', '.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', ',', 'save', 'with', 'a', 'gibe', '\\n', 'and', 'a', 'sneer', '.', 'They', 'were', 'admirable', 'things', 'for', 'the', 'observer', '—', 'excellent', 'for', '\\n', 'drawing', 'the', 'veil', 'from', 'men', '’s', 'motives', 'and', 'actions', '.', 'But', 'for', 'the', 'trained', '\\n', 'reasoner', 'to', 'admit', 'such', 'intrusions', 'into', 'his', 'own', 'delicate', 'and', 'finely', '\\n', 'adjusted', 'temperament', 'was', 'to', 'introduce', 'a', 'distracting', 'factor', 'which', 'might', '\\n', 'throw', 'a', 'doubt', 'upon', 'all', 'his', 'mental', 'results', '.', 'Grit', 'in', 'a', 'sensitive', '\\n', 'instrument', ',', 'or', 'a', 'crack', 'in', 'one', 'of', 'his', 'own', 'high', '-', 'power', 'lenses', ',', 'would', 'not', '\\n', 'be', 'more', 'disturbing', 'than', 'a', 'strong', 'emotion', 'in', 'a', 'nature', 'such', 'as', 'his', '.', 'And', '\\n', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', ',', 'and', 'that', 'woman', 'was', 'the', 'late', 'Irene', '\\n', 'Adler', ',', 'of', 'dubious', 'and', 'questionable', 'memory', '.']\n",
      "251\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "#!python -m spacy download en_core_web_sm # Run this line only the first time you run this notebook to download the model\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "doc = nlp(sherlock_holmes_part_of_text)\n",
    "words_spacy = [token.text for token in doc]\n",
    "print(words_spacy)\n",
    "print(len(words_spacy))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c0c1733",
   "metadata": {},
   "source": [
    "# Find the difference between the lists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "e864726e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'high', 'power', 'observer', '-', '_', '—', 'excellent', '’s', '\\n'}\n"
     ]
    }
   ],
   "source": [
    "print(set(words_spacy)-set(words_nltk))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d7ead4c",
   "metadata": {},
   "source": [
    "# Compare the time it takes using both methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "470f8bcb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NLTK: 0.0009722709655761719 s\n",
      "spaCy: 0.022734880447387695 s\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "def split_into_words_nltk(text):\n",
    "    words = nltk.tokenize.word_tokenize(sherlock_holmes_part_of_text)\n",
    "    return words\n",
    "\n",
    "def split_into_words_spacy(text):\n",
    "    doc = nlp(text)\n",
    "    words = [token.text for token in doc]\n",
    "    return words\n",
    "\n",
    "start = time.time()\n",
    "split_into_words_nltk(sherlock_holmes_part_of_text)\n",
    "print(f\"NLTK: {time.time() - start} s\")\n",
    "\n",
    "start = time.time()\n",
    "split_into_words_spacy(sherlock_holmes_part_of_text)\n",
    "print(f\"spaCy: {time.time() - start} s\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
